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Abstract 

The Cancer Genome Atlas (TCGA) provides researchers with clinicopathological data and genomic char- 
acterizations of various carcinomas. These data sets include expression microarrays for genes and microR- 
NAs - short, non-coding strands of RNA that downregulate gene expression through RNA interference 
- as well as days_to_death and days_to_last_f ollowup fields for each tumor sample. Our aim is to 
develop a software tool that screens TCGA data sets for genes/miRNAs with functional involvement in 
specific cancers. Furthermore, our computational pipeline is intended to produce a set of visualizations, 
or profiles, that place our screened outputs in a pathway-centric context. 

We accomplish our 'screening' by ranking genes/miRNAs by the correlation of their expression misregu- 
lation with differential patient survival. In other words, if a gene/miRNA is consistently misregulated in 
patients with poor survival rates and, on the other hand, is expressed more 'normally' in patients with 
longer survival rates, then it is ranked highly; if its misregulation has no such correlation with good/bad 
survival in patients, then its rank is low. Our pathway profiling pipeline produces several outputs, which 
allow us to examine the functional roles played by highly ranked genes discovered by our screening. 

Running the OV (ovarian serous cystadenocarcinoma) data set through our analysis pipeline, we find 
that several highly ranked pathways and functional groups of genes (VEGF, Jun, Fos, etc.) have already 
been shown to play some part in the development of epithelial ovarian carcinomas. We also observe that 
several top-ranking miRNAs target oncogenes in the top two quartiles of our rank-ordered list, implying 
that our ranking scheme is sound in principle and effectively sorts genes/miRNAs by their functional 
involvement in cancer. 

Our outputs suggest that the dysfunction of the Wnt signaling pathway, which regulates cell-fate specifi- 
cation and progenitor cell differentiation, has a disproportionate impact on the survival of ovarian cancer 
patients. This work has immediate implications: (1) efficient cancer diagnostics with microarrays for 
highly ranked genes/miRNAs in Wnt, and (2) novel drug treatments that target Wnt as well as other 
highly ranked pathways and functional groups of genes. It also motivates the improvement of MiRank to 
take into consideration the network of interactions between gene products and the optimization of node 
selection by cancer, a disease that evolves to selectively misregulate certain genes in order to lead a cell 
through tumorigenesis. 
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1 Introduction 



The use of computational tools to elucidate the mechanisms underlying the genetic, misregulatory aspect of 
cancer is intended to relieve the often tedious efforts made by wet-lab biologists to identify 'significant' genes 
and microRNAs by knock-out analysis. The goal of our pipeline is to allow researchers to screen cancer data 
sets for genes and mlRNAs that are more likely to experimentally validate, leaving more time for investigation 
into molecular mechanisms less amenable to quantitative, bioinformatics analysis. 

Most scripts developed for computational research in cancer biology are designed to predict clinicopathological 
features of tumors based on gene expression profiles. Such profiles are typically optimized to identify sets 
of genes whose expression is highly correlated with patient survival, tumor grade, chemotherapy sensitivity, 
etc. 

We postulate that the genes in expression profiles optimized for survival prediction in cancer 
patients are functionally implicated in the disease. Similarly, we propose thai ranking genes by 
the correlation of their expression misregulation with differential patient survival will result in an 
ordering that reflects the extent of each gene 's functional role in a specific cancer. 

Instead of following the canonical approach of identifying sets of genes whose expression can separate pa- 
tients into two groups whose survivorship distributions arc maximally 'different' (as measured by the log-rank 
statistic), our ranking algorithms take a gcnc-by-gene approach to measuring expression-survival correlation: 

• MiRank-A follows a multi-step, non-traditional process that outputs a ranking metric T (for each 
gcnc/mlRNA) that measures the correlation of differential expression with differential patient survival. 

• MiRank-B is essentially MiRank-A, backwards. While MiRank-A orders patients by differential expres- 
sion, then examines the resultant distribution of survival (across the ordered set of patients) , MiRank-B 
first stratifies patients by differential survival then examines the resultant distribution of differential 
expression. An expression-survival correlation metric M follows from this analysis. 

• MiRank-C is simply a univariate Cox regression (with expression misregulation values as covariates) 
that generates a set of regression coefficients, /3, that can be used to rank genes/miRNAs by our 
expression-survival correlation scheme. 

Our ranking scheme can also be extended to the world of microRNAs, short non-coding sequences of RNA 
that downregulate gene expression through RNA interference. In RNAi, one miRNA can target multiple 
genes. As such, we are able to examine the relationship between highly ranked miRNAs and highly ranked 
genes. We are especially interested in showing that the bipartite network of top-ranking genes and miRNAs 
is densely /sparsely connected, which reflects the efficacy of our ranking pipeline. 

To justify our use of novel ranking methods, we compare the predictive power of the prognostic indices 
constructed by Yoshihara et. al. through 'traditional' expression profile-based methods to that of an index 
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generated by our pipeline outputs. Wc do so by performing a Cox regression on a rank-optimized, 260-gene 
prognostic index that is defined as follows. 



260 



Ik = ^ Pi ■ X, 



ik 



(1) 



where Xik is the expression misregulation metric of the ith ranked gene for the kth patient, and Pi is the 
corresponding regression coefBcient. The regression delivers a hazard ratio that measures the correlation 
between patient survival and our prognostic index. 

Though gene/miRNA ranking is sufficient for our specified goal of identifying genes and miRNAs of interest 
to researchers, we can do more. Genes do not act in isolation, but rather through regulatory networks of gene- 
product interactions. As such, rank data is more useful to biologists when placed in a pathway-centric context. 

We accomplish this contcxtualization by (1) annotating KEGG Pathways gene-product wiring diagrams with 
color gradients that reflect the distribution of gone ranks within a pathway, (2) ranking pathways by their 
proportion of 'high' gene ranks, and (3) showing the enrichment of certain gene ontology (GO) terms in 
highly/poorly ranked functional groups of genes. Thus, we elucidate the functional roles played by highly 
ranked genes and miRNAs in tumorigenesis, which encapsulates excessive cell proliferation, migration, an- 
giogenesis, etc. 

The homebrew software packages that implement our analysis pipelines were written in Java, developed and 
tested on Mac OS X, and are freely available at http : //code . google . com/p/miranktool. We have included 
sample inputs and outputs to illustrate the nature of our scripts. 

2 Methods 

2.1 Expression Misregulation Metric X 

We propose the metric X to measure gene/miRNA expression misregulation. X is defined as follows. 



where Ci is the set of expression- values (across patients) for the ith gene/miRNA in OV (cancer sample set), 
Ci is the mean of set Ci, Ni is the set of expression- values (across patients) for the ith gene/miRNA in NC 
(normal control sample set), and Ni is the mean of set iVj. / is defined as follows. 



Xi = \og,o{\ f{Ci,Ni)\) 



(2) 




■ x>y 

:x<y 
:x = 0\/y = 



(3) 
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2.2 MiRank-A 



2.2.1 Survival Spreads 

We begin by ordering the patients in the input TCGA data set by their expression of gene/miRNA X (from 
least to greatest expression- value). 



Order patients biy 




expresikm-vaiue of 




gene/miFlNA urxler 




considsralim 





TCGA-XX-XXXX 
m\n. exp. val. 



TCGA-XX-XXXX 
max. exp. val. 



"bonom" 



"top" 



After ordering, we take iterative "slices," or groupings of patients, according to the following scheme. 



Take iterattve sices of psiti«rrt$ in OV 



Group 2 <- "bottom" 20, 
Group 1 <- "top" 20 



Group 2 <- 'tottom" 40, 
Group 1 <- "top" 20 



Group 2 <- 'tottom" 20, 
Group 1 <- "top" 40 



Group 2 <- "bottom" 40, 
Group 1 <- "top" 40 



Group 2 <■ "Ijollom" |0V| - 20, 
Group 1 <- "top" 20 



Group 2 <- 'tottom" 20, 
Group 1 <- topf' |OV| 



Group 2 <- "bottom" 40, 
Group 1 <- "lop" |OV| - 20 



Group 2 <■ "Ijollom" |0V| - 20, 
Group 1 <- "top" 40 



Group 2 <■ "bottom" |0V|, Group 1 <■ "top" 20 



We now generate a survival "spread" for each grouping. These "spreads" encapsulate the data necessary to 
calculate the log-rank statistic Z for each grouping. The log-rank statistic, also known as the Mantel-Cox 
test statistic, measures the similarity of the survival distributions of two patient groups (group 1 and group 
2 in figure 1.2). 
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Generate sjrvival "spread" for each grouping (example below for 'bottom 20 vs. top 160') 



Interval = 270 

figures in corresponding row are recorded using data from 

patients for wtiom 

180 < nna>!(days_tQ_death, days_tQ_last_followup) <^ 270 



nurn_patients_alive in succeeding row is calculated by 
subtracting the number of right-censored events {deaths 
ajid last foil owups') from num_patients_aJive m preceding 
rciiw 



After iterating this process across the set of genes/miRNAs X, we are left with the set of survival spreads S. 
2.2.2 Log-Rank Matrices 

Let us consider the survival spread S, which corresponds to a certain grouping of patients according to their 
expression of gene/miRNA X. With the data encapsulated by 5, we may calculate Z, the log-rank statistic 
that measures the similarity between the survival distributions of groups 1 and 2. Z is defined as follows. 







where Onj is the number of right-censored events occurring in group n at time j, Enj is the expected number 
of right-censored events occurring in group n at time j, and Vj is the variance of the Oj distribution (where 
Oj = Oij + 02j)- Enj is defined as follows 



En,=0,^ (5) 



where N^j is the number of patients in group n who have not yet died or been right-censored by time j, and 
Nj = Nij + N2j ■ Vj is defined as follows 

0,(§^)(l - ^)(N, - O,) 



After calculating the log-rank statistic for each survival spread, we compile the Z values corresponding to 
the survival spreads associated with the gcnc/miRNA X into a matrix. 
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Generate log-rank statistic for each survival spread 




Column labeis correspond to the size of the "top" group. 
Row tf^els correspond to the size of the "bottom" group. 



After iterating this process across the set of genes/miRNAs X, we are left with the set of log-rank matrices 
M. 

2.2.3 Rank Spread 

From our set of log-rank matrices M, we may generate the following "rank spread." 



gene 




grouping style Ctx>ttom;top) 


min 


grouplr»g style (bottom; top) 


mear^ 


meclan 


>1.96 


s2.5a 


r^PO 


2.6&4I4 


40; 330 


0.0027628357 


90; 20 


1.0409412 


1.0260323 


121 


4 


NOSM 


2.6279283 


110; 330 


0.0029707423 


30; 410 


1.4576112 


1.5569885 


333 


13 


nox.1 


2.12043S3 


20; 30 


3.0830943E-4 


110; 30 


0.7022139 


0.60062206 


3 







2.2322666 


30; 30 


9. 387381 3t- 4 


400; 190 


0.90940813 


0.96838096 
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3.2343083 


60; 370 


0.18381436 


320; 70 


1.713567 


1.7367335 


663 


133 


noxfs 


2.3433734 


110; 30 


0.0014420918 


440; 70 


0.8317974 


0.831254 


28 





NR2F2 


2.142266 


110; 480 


0.0011862807 


220; 270 


0.6040009 


0.55641484 


4 





SODl 


2.0319863 


30; 140 


1.1023997E-4 


110; 210 


0.6683362 


0.5379441 


4 





S0D2 


1.7361692 


420; 90 


4.99713E-4 


20; 150 


0.62343413 


0.60242346 








S0D3 


1.367376 


300; 90 


6.001332E-4 


160; 330 


0.44674462 


0.37339982 









where >1 . 96 represents the number of log-rank statistics in the matrix (corresponding to a certain gene/miRNA) 
whose values are greater than 1.96 (likewise for >2.58). Note that when Z > 1.96, p < 0.05, and that when 
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Z > 2.58, p < 0.01. 

2.2.4 Expression-Survival Correlation Statistic T 

We propose the metric Ti to measure the correlation of patient survival with the expression of the ith 
gene/miRNA. Tj is defined as follows. 

Ti = max{Zi) ■ median{Zi) (7) 

where max{Zi) is the maximum log-rank statistic found in the log-rank matrix of the ith gene/miRNA, and 
median{Zi) is the median of the log-rank statistics found in the log-rank matrix of the ith gene/miRNA. 

2.3 MiRank-B 
2.3.1 Patient Grouping 

We begin by dividing patients into a series of 21 groups, according to the following scheme. 





subffroun A 


fiubffrour) R 


group 1 


trienfh < 90 davs 3 mouths) 


tdenfh > 90 davs 3 months) 


group 2 


tdeath < 90 days 3 months) 


tdeath > 180 days (~ 6 months) 


group 3 


tdeath < 90 days (~ 3 months) 


tdeath > 365 days 12 months) 


group 4 


tdeath < 90 days 3 months) 


tdeath > 1095 days 36 months) 


group 5 


tdeath < 90 days 3 months) 


tdeath > 1825 days 60 months) 


group 6 


tdeath < 90 days (~ 3 months) 


tdeath > 3650 days 120 months) 


group 7 


tdeath < 180 days (~ 6 months) 


tdeath > 180 days (~ 6 months) 


group 8 


tdeath < 180 days (~ 6 months) 


tdeath > 365 days 12 months) 


group 9 


tdeath < 180 days (~ 6 months) 


tdeath > 1095 days (~ 36 months) 


group 10 


tdeath < 180 days (~ 6 months) 


tdeath > 1825 days 60 months) 


group 11 


tdeath < 180 days 6 months) 


tdeath > 3650 days 120 months) 


group 12 


tdeath < 365 days 12 months) 


tdeath > 365 days (^12 months) 


group 13 


tdeath < 365 days (~ 12 months) 


tdeath > 1095 days (~ 36 months) 


group 14 


tdeath < 365 days (~ 12 months) 


tdeath > 1825 days (~ 60 months) 


group 15 


tdeath < 365 days 12 months) 


tdeath > 3650 days 120 months) 


group 16 


tdeath < 1095 days 36 months) 


tdeath > 1095 days 36 months) 


group 17 


tdeath < 1095 days (~ 36 months) 


tdeath > 1825 days (~ 60 months) 


group 18 


tdeath < 1095 days (~ 36 months) 


tdeath > 3650 days 120 months) 


group 19 


tdeath < 1825 days 60 months) 


tdeath > 1825 days (~ 60 months) 


group 20 


tdeath < 1825 days (~ 60 months) 


tdeath > 3650 days (~ 120 months) 


group 21 


tdeath < 3650 days (~ 120 months) 


tdeath > 3650 days (~ 120 months) 



Figure 1.1.1 



where tdeath is equivalent to max(days_to_death,days_to_last_f ollowup). 
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2.3.2 Expression and Correlation Statistics 



For each group, we generate the following "expression statistics" across all genes/miRNAs: min(X), max{X.), 
medianCK), mean(X), and tT(X), where X is the set of expression values for a given gene/miRNA. This in- 
formation may later be used for verification/reference. 



Additionally, for each group we generate the following "expression-survival correlation statistics" across all 
genes/miRNAs: 0{A[JB), 0{B), 0{A), f{B,A), te{B,A), p{te), tu{B,A), andp(t„), where O is the fraction 
of patients in the parameter-specified subgroup whose expression value falls within the range of expression 
values in the subgroup they are not contained in, f{B, A) is the fold change of expression values from 
subgroup A to subgroup B, is the student's t-tcst statistic testing the null hypothesis that the distribution 
of expression values in subgroup A and subgroup B are equal (assuming equal variance) , p is the probability 
density function for the student's t-test distribution, and t„ is the student's t-test assuming unequal variance. 
O is defined as follows. 

where I (subset of subgroup Z) is the set of patients whose expression value falls within the range of expression 
values in the subgroup they are not contained in, and k is the set of patients in group k. f is defined as 
follows. 

( X 



Xe 



f{B,A) = { ""^^ ■ - - (9) 



where Xa is the mean expression- value in subgroup A (likewise for Xb and subgroup B). is defined as 
follows. 

te{B, A) = _ ,^ ,^ . . = (10) 



I (\B\-l)Var(XB) + (\A\--i.)Var(XA) / J_ , ^ 
V |S| + |A|-2 V 1^1 1^ 



L, is defined as follows. 



^1 



tu{B, A) = , ^b-Xa _ 

^ ' Var(XB) I Var(XA) ' 

V |B| ^ 1^1 



p is defined as follows. 



where v is the degrees of freedom of t, and r(n) is the Gamma function (Lanczos approximation of {n — 1)!). 
V is defined as follows. 



■VjMXbI I Var{XA) \2 
\B\ + \A\ 



t |B| ) _|_ \ \A\ ) 



\B\-\ -r \A\-1 
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Generate expression statistics for each subgroup cf 
each group 



Generate expression-survival correlation statistics for each 

group 



overlap (to I a]) <-> fraclion of patients with expression vaiues lliat lie within the range of expression vajues obsertfed m tlie 
subgroup they are not in, 

civerlap(bad) fraction ot patients in subgroup A cinat... 

overlap (good) <-> fraction of patients in subgroup B that... 

(old change <-> fold change ol expression values trom subgroup B to subgroup A 

student's t-t#st statistic <=> tests tfie nul|i Inypothesis tinat the distribution of expression values in subgroup A is idenlicaJ to 
the,. .in subgrnup B 

student's t-test p-value <-> probflbility density for student's t-test statistic 

(equal variance) <=> assumes equal variance in iViio distributions under consideration 

(unequal variar>ce) <-> assumes unequal variance... 



2.3.3 Recombination 

Now, we reorganize our data tables so that each subgroup's expression statistics may be compared side-by-side 
in the same spreadsheets. 



We perform a similar 'recombination' on our expression-survival correlation statistics, so that each statistic 
is isolated in its own table (and its change across groups is made clear). 
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'Recombine' 




correlation statistics 
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2.3.4 Expression-Survival Correlation Statistic M 

We propose the metric to measure the correlation of patient survival with the expression of the ith 
gene/miRNA. Mi is defined as follows. 

= (\fiB,,,A,,)\ ■ (1 - OMk U B,)) ■ (14) 

where f{Bik,Aik) is the fold change of expression values of the ith genc/miRNA from subgroup A of group k 
to subgroup B of group k, lb{B) returns the lower bound of survival (in days) in subgroup S, ub{A) returns 
the upper bound of survival (in days) in subgroup A, and Oi{Ak U Bk) is the overlap fraction of the «th 
gene/miRNA in the context of group k {1 < k < 21). 

2.4 MiRank-C 
2.4.1 Survival Analysis 

The survival function S{t) = Pr{T > t), where Pr is probability and T is the time of death, can be 
used to describe the survival of a population over time. Given the clinical information days_to_death and 
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days_to_last_f ollow_up for each patient in the TCGA data set the input TCGA data set, one can plot a 
Kaplan-Mcicr step function that approximates S{t) within specified intervals. Specifically, the KM curve is 
a plot of S{t), the maximum likelihood estimate of S{t), over time. S{t) is defined as 

Sit) = n (15) 

where rij is the number of survivors less the number of right-censored losses that occur in the input TCGA 
data set at time ti, and di is the number of deaths that occur in the input TCGA data set at time f,. Thus, 
S{t) is associated with the clinical information encapsulated by the input TCGA data set. 

S' (t) 

The hazard function X{t) = — describes the instantaneous density of events (deaths and right-censored 
losses) at time t. The cumulative hazard function A{t) = X{u) du = — ln[S{t)) describes the accumulated 
hazard from time to time t. Thus, A{t) is associated with S{t). 

2.4.2 Cox Proportional-Hazards Model 

The Cox Proportional-Hazards Model allows us to relate certain covariates, or explanatory variables, to the 
cumulative hazard function through the following parameterization 

A{t\X.)k = AQ{t)ke^''^''-''+^'^'-''+^^^'-''+-+^"^'^'' (16) 

where Ao{t)k is the baseline hazard function for the fcth patient, Xk is the vector of covariates for the A;th 
patient, and (3 is the set of regression coefficients. Xk is defined as the vector of expression covariates, where 
Xik is the expression misregulation metric of the ith gene/miRNA in the fcth patient. 

Undc^r this relation, covariate Xik has a multiplicative effect proportional to Pi on hazard. Thus, gene/miRNA 
expression misregulation is associated with patient survival. 



2.4.3 Cox Regression 

Our goal is to rank genes/miRNAs by the correlation of their expression misregulation with differential pa- 
tient survival. We accomplish this by estimating the values of the regression coefficients /3 from equation (16), 
which can then be used to compare the hazard ratios of genes/miRNAs. First, let us define the following 
functions. 

The partial likelihood function describes the probability of a parameter in an equation like (16) having a 
certain value given a set of observed data. Partial likelihood is defined as follows 

where t is a discrete unit of time (at which at least one event occurs), R is the set of patients still at risk 
at time t, X^k is the expression misregulation value for the nth gene/miRNA of the kih patient, X^ is the 
sum of the expression misregulation values of the patients dying at time t, Dk{t) is the probability of the fcth 
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patient's death at time t, and D{t) is the number of deaths that occur at time t. 



The log partial likelihood, which produces more computationally 'convenient' probability figures than the 
partial likelihood, is defined as follows 



Dit) ( - In ( Ee^'"=^"Dfc(t) 

\ VfeeR y 



(18) 



Our goal is to find $, the set of maximum partial likelihood estimates (MPLES) for /3. To do so, we must 
find the /3„-values that maximize their log partial likelihood functions IniPn)- 

We maximize ln{Pn) by approximating the roots of its first derivative using Newton's method. This gives us 
^„-values that correspond to the extreme points of Z„. The /3„-value that corresponds to the extreme point 
with the greatest Z„ value will be our MPLE. 



The partial score function is defined as follows. 

4(/3n)=E 
t 

The first derivative of the partial score function (second derivative of ln{Pn)) is defined as follows. 



""^'^ ' " E..Re--^''A(t) ) 



(19) 



EfeeR<^^"'=^"^fc(t) 



(20) 



Newton's method, also called the Newton-Raphson algorithm, for approximating the zeroes of a univariate 
function f{x) is described by the following recursive relation 



f{Xn) 

nxn) 



(21) 



Applying Newton's method to the approximation of the roots of the partial score function, we find that 



fc+1 — Pn,k 



l'{Pn,k) 



(22) 



where Pn,k is the fcth iteration on the value /3„. After applying this recursive relation to each value in /3, we 
are left with our set of MPLEs /3. 



2.4.4 Gradient Descent with Modification 

Although the convergence rate of Newton's method is fairly high, reducing the computational complexity 

of MiRank-C, the log partial likelihood function docs not conform to all the preconditions of the Newton- 
Raphson algorithm. Because the partial score function is discontinuous, Newton's method can sometimes 
encounter an x-value outside the likelihood function's continuous interval, causing the recursive relation to 
'diverge' permanently. For this reason, we must modify our approach to approximating the zero of the partial 
score function. 



We propose a 'gradient descent with modification' approach to approximating the zero of the partial score 
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function for /?„ values that 'escape' the hkehhood function's interval of continuity during an iteration of 
Newton's method. The algorithm is as follows on the next page. 



2.5 Pathway Profiling 



2.5.1 Consolidated Rank a 



To simplify our enrichment analyses, we propose the following 'consolidated' rank a. 



Z{0i) + Z{Mi) + Z{Ti) + Z{Xi) 



(23) 



4 



where Z is the normalization function defined as follows. 



Z{Yi) = 



(24) 



where Y is the mean of set Y, and cr(Y) is the standard deviation of set Y. 
2.5.2 KEGG Pathway Enrichment 

With our set of ranks a, we can now take gene classes defined in KEGG Pathways (i.e. Ubiquitin-mediated 
Proteolysis Pathway, p53 Signaling Pathway, etc.) and calculate Fisher's one-sided exact test statistic r for 
each. This figure ranks each pathway by the proportion of occurrences of "high" ranks in its gene/miRNA 
class against the occurences of "high" ranks in all genes being considered, r is defined as follows. 



where P is the set of genes implicated in the ith pathway, X is the vector of gcncs/miRNAs under consider- 
ation, Gp is the subset of P for which a„ > a, and Gx is the subset of X for which a„ > a. 

Ti now represents the enrichment of the ith pathway with highly ranked genes. 



m _ \Gp\ |x| 



(25) 



^ \Gx\ ■ |P.| 
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2.5.3 Clustering 

We cluster using a self-organizing map (SOM), a method that produces clusters that are more "stable" 
but less "compact" than those produced by k-means partitioning, partitioning around medoids, etc. The 
algorithm is as follows. 

1. Let Xj : {Pi,Mi,Ti,Xi} be the vector associated with the ith gene/miRNA. 

2. Randomly choose a sample vector x from the input data set. 

3. Find the gene/miRNA whose vector minimizes the following 

||x - mftll = min{||x - mill} (26) 

i 

where nij is the vector associated with the ith gene, b denotes the best-matching unit (BMU), and 
||a — b|| represents the Euclidean distance between a and b. 

4. Update gene/miRNA vectors (excepting the BMU) according to the following rule. 

niiit + 1) = mi(t) + a{t)hbi{t)[K - ini{t)] (27) 

where t is time, a is the learning rate, and hu is the neighborhood kernel centered on the BMU. a{t) 
is defined as follows. 



hbi is defined as follows, 
where a{t) is defined as follows. 



a(t) = - — ]—r^ (30) 
5. Repeat steps 2-4 until a < 0.25 
2.5.4 Partitioning 

After clustering with an SOM, we apply hierarchical agglomerative clustering to the resultant set of vectors, 
thus assigning each gene/miRNA to a discrete partition in a series of "levels." The algorithm is as follows. 

1. Place each sample vector from the input data set into its own singleton partition. 

2. Merge the two closest partitions. The distance between a partition A and partition B is defined as 
follows. 

' " ' ieA jeB 

3. Add the current set of partitions to level k, where k is the distance between the two closest partitions 
from step 2. 

4. Repeat steps 2 and 3 until all the data are merged into a single partition. 

After judicious analysis of pipeline outputs, one may select a set of partitions with an 'appropriate' magnitude 
of separation. 
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2.5.5 GO Term Enrichment 



After delineating partitions, we generate a spread of GO terms associated with the genes in each partition. 
Additionally, we show the relative frequencies of occurrences of GO terms within each partition. Gene:Term 
associations were found using data from ftp://ftp.ebi.ac.uk/pub/databases/GD/goa/HUMAN/. Term de- 
scriptions were also web-scraped from http : //www . ebi . ac . uk/QuickGO. 

3 Results 

3.1 Gene/miRNA Ranking 
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Top 20 Genes/miRNAs, by Rank 



3.2 Prognostic Index Performance Comparison 

The 260-gene prognostic index constructed from the outputs of our ranking pipeline achieved a hazard ratio 
of 1.15 (a 1 unit increase in a patient's prognostic index corresponds to a 1.15x increase in a patient's 'hazard', 
or risk of death). Yoshihara et. al. reported an HR of 1.62 for their genetic profile-based survival index 
(applied to Tothill's data set of serous ovarian cancer patients). 



17 



3.3 Pathway Profiling 
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Pathway Enrichment 



The foUowing gene-product wiring diagrams were taken from the KEGG Pathways database and annotated 
(coloured) by our scripts. 

4 Discussion 

4.1 Ontology Overview of Top-Ranking Pathways 

• The Wnt signahng pathway plays a significant role in cell-fate specification and progenitor-cell prolifer- 
ation, which is significantly upregulated in cancerous germ line cells that give rise to (ovarian) epithelial 
cells. 

• Activation of the VEGF signaling pathway leads to the upregulation of genes involved in mediating the 
proliferation and migration of endothelial cells and promoting their survival and vascular permeability, 
a key attribute of tumorigenic cells. 

• The MAPK signaling pathway regulates critical cellular functions like proliferation, differentiation, and 
migration: functions whose misregulation is associated with cancer. 

• The Notch signaling pathway encodes the processes necessary for intercellular signaling, which is nec- 
essary for tumorigenesis and cell migration (metastasis). 

• Apoptosis, or programmed cell death, is often suppressed once a cell undergoes carcinogenesis. 
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A gradient (passing through dark gray for values near the mean) separates low-ranking genes, in bright 

green, from high-ranking genes, in bright red. 
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4.2 Ontology of Top-Ranking Genes 
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Gene: Associated GO Terms (# occurrences of term in table) 



Highlight: Several genes in our top-20 set are ontologically involved in the canonical Wnt receptor signaling 
pathway (highly enriched, by r), which regulates key functions like cell-fate specification and progenitor-cell 
proliferation. 
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4.3 Oncogenic Targets of Top-Ranking miRNAs 
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miRNA: Gene Targets (Position on list, sorted by rank) 



4.4 Intra-pathway "Driver" Genes 

Functional groups of genes mentioned below are highly ranked within their respective pathways (and are 
coloured red in their annotated KEGG diagrams). 

Pathways in Cancer (KEGG 05200): (1) upregulation of TCF leads to evasion of apoptosis, programmed 
cell death; (2) upregulation of Wnt leads to excessive cell proliferation; (3) upregulation of VEGF, Jun and 
Fos leads to sustained angiogenesis, which provides nourishment for the growing tumor. 

Apoptosis (KEGG 04210): (1) downregulation of Cn leads to downregulation of one of the Ca2+-induced 
Cell Death Pathways involved in apoptosis; (2) downregulation of CASP8 leads to the downregulation of dna 
fragmentation and cell degradation. 

MAPK Signaling Pathway (KEGG O4OIO): (1) upregulation of Fos leads to excessive cell proliferation and 
differentiation (in germ lines); (2) upregulation of NFkB leads to cell proliferation, inflammation and anti- 
apoptosis. 



5 Conclusion 

The outputs of our analysis pipeline support our hypothesis that the expression-survival correlation scheme 
described in Section 2 can be used to rank and identify genes/miRNAs that play critical roles in ovarian 
carcinoma. 
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• Several top-ranking functional groups of genes (Wnt, VEGF, Jun, Fos, etc.) are already known to be 
involved in tumorigenesis. 

• Several miRNAs in our top-20 set (sorted by rank) have targets in the top two quartiles of genes. 

• The misregulation of highly-ranked genes in enriched pathways drives excessive cell proliferation, mi- 
gration, angiogenesis, etc., which are characteristic of ovarian carcinoma. 

Researchers now have at their disposal a scries of open-source scripts that can be used to screen TCGA and 
KEGG data sets for genes/miRNAs and pathways of 'interest' in a given cancer. 

The hazard ratio of our prognostic index was slightly below those reported by Yoshihara et. al. As such, 

to improve our experimental procedure, wc coiild closely examine our ranking scheme and perhaps expand 
its scope. As described in Section 2, our pipeline takes only expression data and clinical information as 
inputs. Integrating other genomic characterizations of ovarian cancer provided by TCGA, like copy-number 
aberrations and dna methylation irregularities, into MiRank-A/B/C could improve the performance of our 
prognostic index, and thus, the selectiveness of our gene/miRNA ranking. 

In examining the outputs of CBioKonnect (http://sourceforge.net/projects/cbiokonnect/), a data 
visualization tool wc developed last summer, wc found that several genes (including p53, a tumor suppres- 
sor that is typically downregulated in cancer) are heavily mutated in ovarian carcinoma, but seem to have 
'normal' expression levels according to TCGA and CBioPortal data. This is most likely because the oligonu- 
cleotide probes in most expression microarrays are not designed to measure the up/down- regulation of genes 
that occurs due to mutation. 

This masking of the 'triie' expression levels of genes like p53 highlights a deficiency in our analysis pipeline: 
a reliance on 'incomplete' expression data from TCGA. If we were to calculate the effective expression levels 
of genes/miRNAs rather than rely solely on TCGA microarray data, I'm sure we would find many more of 
them to be misregulated (and thus, more highly ranked). 

The results of our pathway-centric profiling of TCGA data sets lead us to wonder how cancer has evolved to 
target such a diverse range of genes and miRNAs, especially those that regulate critical cellular functions. If 
protein-protein interaction networks are as robust as most quantitative biology studies claim, how is it that a 
single disease can misregulate several pathways to such an extent that a cell becomes cancerous? Examining 
the topology of the network of genes/miRNAs targeted by various cancers and showing that it reflects 
an optimal selection of network nodes to misregulate (in order to accomplish excessive cell proliferation, 
migration, etc.) would highlight key elements of cancer's "network of misregulation" and allow researchers 
to design drugs that target genes/miRNAs that act as "drivers" of misregulated cellular activity. 
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